Author: Madalina-Alina Racovita, 1st year master's student on Computational Optimization at Faculty of Computer Science, UAIC, Iasi, Romania

The way in which the value of a house is currently set in the real estate industry is not necessarily statistically robust. There are two states in which a person can be in the real estate market: seller or buyer. Regardless of the hypostase in which a given person is, it is a certain fact she may under evaluate or she may overestimate that property by reasoning subjectively.
Given the hundreds of property sales that occur in one specific period of time, it is logical to evaluate and analyze the sales from multiple points of view to establish a more accurate value for a house with a specific characterization. Regression analysis is a statistical approach of modeling the relationship between one or more independent or explanatory variables (characteristics of a house), and a dependent variable (the value or selling price of the house).The already created models have proven that using regression analysis is a viable way to better establish an estimate of the true value of a house. Classification can also help in making distinctions between different properties.
This notebook is attempting to achieve explanatory data analysis, feature selection and feature engineering required for the task of building a regression / classification model that will be able to predict the value of an independent variable (for instance, the sale price of a house in the case of regression or the type of the property in a classification context) as accurately as possible by minimizing a cost error function.

!pip install missingno
!pip install folium
!pip install ipympl
!pip install plotly
# !pip install projection-pursuit==0.4
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import randomcolor
import missingno as msno
import folium
import warnings
import random
import plotly.express as px
import umap
from folium.plugins import HeatMap
from scipy.stats import kurtosis
from scipy.stats import skew
from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import pearsonr
from sklearn.decomposition import PCA
from numpy import linalg as LA
from scipy import stats
warnings.filterwarnings('ignore')
matplotlib.style.use('ggplot')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
os.listdir('./Data')
import pandas as pd
from scipy.stats import skew
df_rcon1 = pd.read_csv("./Data/RCON_12011.assessor.tsv", sep = "\t")
df_rcon1.shape
df_rcon2 = pd.read_csv("./Data/RCON_53033.assessor.tsv", sep = "\t")
df_rcon2.shape
We are going to merge those two dataframes concerning RCON real estates in the purpose of an overview analysis.
df_rcon = pd.concat([df_rcon1, df_rcon2])
df_rcon.shape
print("Number of observations in RCON dataset: ", df_rcon.shape[0])
print("Number of predictors in RCON dataset: ", df_rcon.shape[1] - 1)
df_rsfr1 = pd.read_csv("./Data/RSFR_12011.assessor.tsv", sep = "\t")
df_rsfr1.shape
df_rsfr2 = pd.read_csv("./Data/RSFR_53033.assessor.tsv", sep = "\t")
df_rsfr2.shape
As did in the case of RCON dataframe, the same step is going to be proceed for the RSFR real estates dataframe: there are going to be merged the df_rsfr1 and df_rsfr2 dataframes.
df_rsfr = pd.concat([df_rsfr1, df_rsfr2])
print("Number of observations in RSFR dataset: ", df_rsfr.shape[0])
print("Number of predictors in RSFR dataset: ", df_rsfr.shape[1] - 1)
df_rcon.info()
df_rcon.head()
set(df_rcon['TotalRooms'])
df_rcon.describe()
print("Number of numerical predictors RCON: ", len(list(df_rcon.describe())))
df_rsfr.head()
df_rsfr.describe()
filtered_data = msno.nullity_filter(df_rcon, filter='bottom', p=0.999) # or filter='top'
msno.matrix(filtered_data.sample(300), color=[0.3, 0.3, 0.4])
filtered_data = msno.nullity_filter(df_rcon, filter='bottom', p=0.999)
msno.heatmap(filtered_data)
def get_columns_with_missing_values():
column_with_missing_values = []
valid_columns = 0
for colname in list(df_rcon):
missing_values_current_column_rcon = len([boolean_value for boolean_value in df_rcon[colname].isnull()
if boolean_value == True])
missing_values_current_column_rsfr = len([boolean_value for boolean_value in df_rsfr[colname].isnull()
if boolean_value == True])
if missing_values_current_column_rcon != 0:
column_with_missing_values.append((colname + "_RCON", missing_values_current_column_rcon,
str(missing_values_current_column_rcon / df_rcon.shape[0] * 100) + "%"))
if missing_values_current_column_rsfr != 0:
column_with_missing_values.append((colname + "_RSFR", missing_values_current_column_rsfr,
str(missing_values_current_column_rsfr / df_rsfr.shape[0] * 100) + "%"))
return column_with_missing_values
def highlight(row, colors_dict):
if int(row.name) in colors_dict.keys():
return ['background-color: ' + colors_dict[int(row.name)]]*3
else:
if int(row.name) < max(colors_dict.keys()) - 1:
return ['background-color: ' + colors_dict[int(row.name) - 1]]*3
else:
return ['background-color: ' + colors_dict[int(row.name) - 1]]*3
def random_color():
return 'rgba' + str(tuple(np.random.randint(256, size=3)) + (0.2,))
missing_values_df = pd.DataFrame(get_columns_with_missing_values(),
columns=['Predictor\'s name', 'No. of missing values', 'Percentage of absence'])
rand_color = randomcolor.RandomColor()
random_colors = {id: random_color() for id in range(0, len(list(missing_values_df.index)),2)}
missing_values_df.style.apply(highlight, colors_dict=random_colors, axis=1)
The columns wit missing percentage bigger than 95% in both RCON and RSFR datasets are going to be deleted since they can represent features with a big predictive power while building any type of Machine Learning model.
column_miss_perc_ge_95 = ['DeedType', 'RoofCode', 'BuildingShapeCode', 'City', 'StructureCode']
for column in column_miss_perc_ge_95:
del df_rcon[column]
del df_rsfr[column]
other_columns = ['BuildingCode', 'GarageCarportCode', 'PatioPorchCode', 'PoolCode', 'Zonning', 'PropTaxAmount', 'FoundationCode', 'ExteriorCode', 'CoolingCode', 'HeatingCode',
'HeatingSourceCode', 'View', 'DocType', 'TransType', 'DistressCode', 'SellPrice']
df_rsfr[other_columns].dtypes
object_miss_values_features = []
numeric_miss_valuez_features = []
types = df_rsfr[other_columns].dtypes
for i in range(len(types)):
if types[i] == object:
object_miss_values_features.append(other_columns[i])
else:
numeric_miss_valuez_features.append(other_columns[i])
for column in object_miss_values_features:
print(column + " ___________ " + str(set([obs for obs in list(df_rsfr[column]) if pd.isnull(obs) == False])))
print()
The missing values for object type columns are going to be replaces with empty strings. For the numeric variables the missing values are going to be replaced with 0, except for SellPrice which is going to be replaced with the LastSalePrice in case of missing value. The other numerical values, even if they are numerical they seem to be descrete, since they are codes representing a category.
numeric_miss_valuez_features
for column in object_miss_values_features:
df_rcon[column] = df_rcon[column].fillna('')
df_rsfr[column] = df_rsfr[column].fillna('')
rcon_medium_sellprice = df_rcon['SellPrice'].mean()
rsfr_medium_sellprice = df_rsfr['SellPrice'].mean()
print("A medium price for a RCON real estate is = ", rcon_medium_sellprice)
print("A medium price for a RSFR real estate is = ", rsfr_medium_sellprice)
for column in numeric_miss_valuez_features:
if column != 'SellPrice':
df_rcon[column] = df_rcon[column].fillna(0)
df_rsfr[column] = df_rsfr[column].fillna(0)
Both of the dataframes are going to be concatenated into a bigger dataframe. I am going to use another column called isRCON. The value is going be 0 if the real estate is part of RSFR dataframe, and 1 if the house is part of RCON dataframe.
df_rcon['isRCON'] = 1
df_rsfr['isRCON'] = 0
df = pd.concat([df_rcon, df_rsfr])
prices = []
for (index, row) in df.iterrows():
added = False
if pd.isnull(row['SellPrice']):
row['SellPrice'] = row['LastSalePrice']
if row['SellPrice'] == 0:
if row['isRCON'] == 1:
prices.append(rcon_medium_sellprice)
else:
prices.append(rsfr_medium_sellprice)
elif row['SellPrice'] != 0:
prices.append(row['SellPrice'])
df['SellPrice'] = prices
rcon_medium_sellprice
df.loc[(df['isRCON'] == 1) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rcon_medium_sellprice
df.loc[(df['isRCON'] == 0) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rsfr_medium_sellprice
bool_type_columns = df.select_dtypes(include=bool).columns.tolist()
bool_type_columns
for column in bool_type_columns:
df[column] = df[column].apply(lambda x: 0 if x == False else 1)
print("Number of numerical predictors: ", len(list(df.describe())))
char_type_columns = df.select_dtypes(include=object).columns.tolist()
char_type_columns
df[df.select_dtypes(include=object).columns.tolist()].head()
def highlight_constant_features(row):
if row['Number of different values RCON'] == 1 and row['Number of different values RSFR'] == 1:
return ['background-color: ' + 'rgb(255, 102, 102, 0.6)']*3
else:
return ['background-color: ' + 'rgba(255, 255, 255, 0.18)']*3
columns = list(df_rcon)
value_counts = pd.DataFrame({"Predictor\'s Name": columns,
"Number of different values RCON": [len(list(df_rcon[column].value_counts())) for column in columns],
"Number of different values RSFR": [len(list(df_rsfr[column].value_counts())) for column in columns]})
value_counts.style.apply(highlight_constant_features, axis=1)
columns_constant_features = ['AtticSqft', 'IsFixer', 'GarageNoOfCars', 'EffectiveYearBuilt', 'TotalRooms']
for current in columns_constant_features:
del df[current]
The features which are semantically similar are going to be grouped into the same category, in a global dictionary, so that their analysis will be facilitated.
features_grouped_dict = {
'code_related': ['CountyFipsCode',
'BuildingCode',
'ConditionCode',
'QualityCode',
'GarageCarportCode',
'PatioPorchCode',
'PoolCode',
'ConstructionCode',
'FoundationCode',
'ExteriorCode',
'CoolingCode',
'HeatingCode',
'HeatingSourceCode',
'DistressCode',
'Condition'],
'dimension_related': ['LandSqft',
'LivingSqft',
'GarageSqft',
'BasementSqft',
'BasementFinishedSqft'],
'number_of': ['StructureNbr',
'Bedrooms',
'TotalBaths',
'FirePlaces',
'Stories',
'UnitsInBuilding',
],
'bool_type': ['HasPatioPorch',
'HasPool',
'IsWaterfront',
'ArmsLengthFlag',
'OwnerOccupied'],
'quality_related': ['Quality',
'View'
],
'location_related': ['Zip',
'Latitude',
'Longitude',
'State',
'Zonning'],
'time_related': ['StatusDate',
'LastSaleDate',
'YearBuilt',
'AssessedYear',
'SellDate'],
'type_related': ['DocType',
'TransType'],
'price_related_regression': ['SellPrice',
'LastSalePrice',
'LandValue',
'ImprovementValue',
'TotalValue',
'PropTaxAmount'],
'classification_related': ['isRCON']
}
features_grouped_dict['price_related_regression']
df[features_grouped_dict['price_related_regression']].head()
columns = list(features_grouped_dict['price_related_regression'])
value_counts = pd.DataFrame({"Predictor\'s Name": columns,
"Number of different values RCON": [len(list(df_rcon[column].value_counts())) for column in columns],
"Number of different values RSFR": [len(list(df_rsfr[column].value_counts())) for column in columns]})
value_counts.style.apply(highlight_constant_features, axis=1)
df['SellPrice'].mean()
df['SellPrice'].median()
df['SellPrice'].mode()
As it can be deducted from the above most of the houses have the price 382558$ which is equal to the mean of the RSFR houses selling price, thing that was expected after replacing the missing values.
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
ax.hist([list(df['SellPrice']),
list(df[df['isRCON'] == 1]['SellPrice']),
list(df[df['isRCON'] == 0]['SellPrice'])],
bins = 25, stacked=True, fill=True,
alpha = 0.37, color=['green', 'purple', 'red'])
ax.set_title('Distribution of the houses depending on the selling price')
ax.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax.set_ylabel('Frequency')
ax.set_xlabel('SellPrice')
plt.show()
Since the percentage of the misssing values for the output variable SalePrice was bigger than 30% for both RSFR and RCON properties (replaced in a next preprocessing step with the mean), I am going to redo the stacked histogram for exploring the distribution of the initial prices in the dataset.
rcon_medium_sellprice
rsfr_medium_sellprice
fig, ax0 = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
mean_sellprice_rcon = df['SellPrice'].mean()
ax0.hist([list(df[(df['SellPrice'] != rcon_medium_sellprice) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice']),
list(df[(df['isRCON'] == 1) & (df['SellPrice'] != rcon_medium_sellprice)]['SellPrice']),
list(df[(df['isRCON'] == 0) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice'])],
bins = 20, fill=True, alpha=0.7,
linewidth=2, histtype='bar',
color=['#ffb366', '#6666ff', '#ff66b3'])
ax0.set_title('Distribution of the houses depending on the selling price \n The houses that had initially missing values were excluded', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('SellPrice')
plt.show()
It seems that most of the houses have the sale price smaller than 2.500.000\$ both for RCON and RSFR dataframes. I am going to verify which is the number of the houses with a price bigger than this financial amount.
str(len(list(df_rsfr[df_rsfr['SellPrice'] > 2500000]['SellPrice'])) / len(list(df_rsfr['SellPrice']))) + "% for RSFR"
str(len(list(df_rcon[df_rcon['SellPrice'] > 2500000]['SellPrice'])) / len(list(df_rcon['SellPrice']))) + "% for RCON"
I am going to drop those values whose price is bigger than 2.500.000$, since not many observations will respect this condition. Both datasets have sufficient values so that I can drop some of them, in the purpose of constructing a more robust regression/classification model (the model should not be constructed based on particularities but on general observable patterns).
df = df.drop(df[df['SellPrice'] > 2500000].index)
I am going to rebuild the histograms to see the distribution of the price among the real estates listed in the samples.
fig, ax0 = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
mean_sellprice_rcon = df['SellPrice'].mean()
ax0.hist([list(df[(df['SellPrice'] != rcon_medium_sellprice) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice']),
list(df[(df['isRCON'] == 1) & (df['SellPrice'] != rcon_medium_sellprice)]['SellPrice']),
list(df[(df['isRCON'] == 0) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice'])],
bins = 20, fill=True, alpha=0.7,
linewidth=2, histtype='bar',
color=['#ffb366', '#6666ff', '#ff66b3'])
ax0.set_title('Distribution of the houses depending on the selling price \n The houses that had initially missing values were excluded', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('SellPrice')
plt.show()
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
ax.hist([list(df['SellPrice']),
list(df[df['isRCON'] == 1]['SellPrice']),
list(df[df['isRCON'] == 0]['SellPrice'])],
bins = 25, stacked=True, fill=True,
alpha = 0.37, color=['green', 'purple', 'red'])
ax.set_title('Distribution of the houses depending on the selling price')
ax.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax.set_ylabel('Frequency')
ax.set_xlabel('SellPrice')
plt.show()
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
sns.distplot(df['SellPrice'], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RCON + RSFR')
sns.distplot(df[df['isRCON'] == 1]['SellPrice'], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RCON')
sns.distplot(df[df['isRCON'] == 0]['SellPrice'], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RSFR')
plt.title('Density plot SellPrice')
plt.show()

print(skew(df['SellPrice']))
print(skew(df[df['isRCON'] == 1]['SellPrice']))
print(skew(df[df['isRCON'] == 0]['SellPrice']))

print(kurtosis(df['SellPrice']))
print(kurtosis(df[df['isRCON'] == 1]['SellPrice']))
print(kurtosis(df[df['isRCON'] == 0]['SellPrice']))
All three distributions are leptokurtic, thing that can be also noticed from the density plot presented in the previous section.
continuous_variables = features_grouped_dict['price_related_regression'][1:-1]
fig, ax = plt.subplots(6,1,figsize=(15,10))
for i in range(len(continuous_variables)):
plt.subplot(2,2,i+1)
plt.scatter(df[continuous_variables[i]], df['SellPrice'], c=df['isRCON'], alpha=0.2, marker="1")
plt.xlabel(continuous_variables[i], fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.xlim(-1e02, 1e6)
plt.show()
plt.figure(figsize = (15,7))
plt.scatter(df['PropTaxAmount'], df['SellPrice'], c=df['isRCON'], alpha=0.2, marker="1")
plt.xlabel(continuous_variables[i], fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.xlim(-1e02, 7e4)
plt.show()
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 5))
sns.distplot(df['PropTaxAmount'], hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RSFR')
features_grouped_dict['time_related']
At a first glance it seems that the SellDate is identical to LastSaleDate, with the mention that LastSaleDate has a different format: YYYY-MM-DD HH:mm:ss. The 'HH:mm:ss' is irrelevant in the context of real estates selling since we are interested only in the day specifications. I am going to verify the equivalence in the next sections:
df['LastSaleDate'] = df['LastSaleDate'].apply(lambda x: str(x.split()[0]))
df['SellDate'] = df['SellDate'].astype(str)
df['LastSaleDate'].head()
list(df['LastSaleDate']) == list(df['SellDate']) # are these two equivalent??
def see_percentage_of_dif_LastSaleDate_SellDate(df, dataset_name):
no_differences = 0
total = len(list(df['LastSaleDate']))
a = list(df['LastSaleDate'])
b = list(df['SellDate'])
for index in range(total):
if a[index] != b[index]:
no_differences += 1
print(str(no_differences / total * 100) + "% " + dataset_name)
see_percentage_of_dif_LastSaleDate_SellDate(df[df['isRCON'] == 1], "RCON")
see_percentage_of_dif_LastSaleDate_SellDate(df[df['isRCON'] == 0], "RSFR")
Since in the RSFR dataset the LastSaleDate is completely different than SellDate, there are going to be kept both columns.
df['SellDate_Year'] = df['SellDate'].apply(lambda x: int(x[:4]))
df['SellDate_Month'] = df['SellDate'].apply(lambda x: int(x[5:7]))
df['SellDate_Day'] = df['SellDate'].apply(lambda x: int(x[8:]))
df[['SellDate_Year', 'SellDate_Month', 'SellDate_Day', 'SellDate']].head()
del df['SellDate']
features_grouped_dict['time_related'].append('SellDate_Year')
features_grouped_dict['time_related'].append('SellDate_Month')
features_grouped_dict['time_related'].append('SellDate_Day')
features_grouped_dict['time_related'].remove('SellDate')
features_grouped_dict['time_related']
df[['StatusDate', 'LastSaleDate', 'YearBuilt', 'SellDate_Year', 'SellDate_Month']].head(10)
df['StatusDate'] = df['StatusDate'].apply(lambda x: str(x.split()[0]))
str(len(set(list(df['StatusDate'])))) + " different values for StatusDate"
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
ax0, ax1, ax2, ax3 = axes.flatten()
ax0.hist([list(df['YearBuilt']),
list(df[df['isRCON'] == 1]['YearBuilt']),
list(df[df['isRCON'] == 0]['YearBuilt'])],
bins=10, histtype='bar')
ax0.set_title('Distribution of the houses depending on YearBuild', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('YearBuilt')
# ------------------- SellDate_Day ----------------------------
ax1.hist([list(df['SellDate_Day']),
list(df[df['isRCON'] == 1]['SellDate_Day']),
list(df[df['isRCON'] == 0]['SellDate_Day'])],
bins=31, histtype='step', stacked=True, fill=False,
linewidth=5, alpha=0.8, color=['#96ceb4', '#ff6f69', '#ffeead'])
ax1.set_title('Distribution of the houses depending on the day in which are sold', fontsize=10)
ax1.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Day in which are sold')
# ------------------- SellDate_Month ----------------------------
ax2.hist([list(df['SellDate_Month']),
list(df[df['isRCON'] == 1]['SellDate_Month']),
list(df[df['isRCON'] == 0]['SellDate_Month'])],
bins=12, alpha=0.4, histtype='bar', stacked=True,
color=['#ffcc66', '#0033cc', '#006666'],
linestyle='dashed', lw=3)
ax2.set_title('Distribution of the houses depending on the month in which are sold', fontsize=10)
ax2.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax2.set_ylabel('Frequency')
ax2.set_xlabel('Month in which are sold')
# ------------------- SellDate_Year ----------------------------
ax3.hist([list(df['SellDate_Year']),
list(df[df['isRCON'] == 1]['SellDate_Year']),
list(df[df['isRCON'] == 0]['SellDate_Year'])],
bins=15, histtype='bar', color=['#ff9900', '#0033cc', '#00cc99'])
ax3.set_title('Distribution of the houses depending on the year in which are sold', fontsize=10)
ax3.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax3.set_ylabel('Frequency')
ax3.set_xlabel('Year in which are sold')
plt.show()
The distributions for the number of the houses sold in a specific year seems not to be identical in the case of RCON vs. RSFR datasets. For RCON dataset it seems that the distribution is closer to a Gaussian one. Meanwhile, when talking about RFSR dataset the distribuition is closer to a random uniform distribution.
fig, ax = plt.subplots(figsize=(17, 5))
sns.boxplot(df['SellDate_Day'], df['SellPrice'])
plt.show()
fig, ax = plt.subplots(figsize=(5, 8))
sns.boxplot(df['SellDate_Month'], df['SellPrice'])
plt.show()
From the above boxplots, it can be observed the fact that in many of the cases the selling of the real estates has a seasonality, many of the houses sold during a year, despite the day of the week or month requirements are evaluated around less than 500.000$.
There can be observed a lot of outliers, but their distribution seems at a first glance to be uniform during a year or a month, this meaning that in a day or in a month cheaper properties are sold preponderantly, but more expensive houses are sold as well.
Having too many features in a model is not always a good thing because it might cause overfitting and worser results when we want to predict values for a new dataset. Thus, if a feature does not improve your model a lot, not adding it may be a better choice.
# df_only_numerical = df[df.columns[~df.columns.isin(char_type_columns)]]
matplotlib.style.use('ggplot')
f, ax = plt.subplots(figsize=(14, 14))
plt.imshow(df.corr(), cmap="viridis", interpolation='nearest')
plt.title('Pearson Correlation among features')
plt.colorbar()
tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation='vertical')
plt.yticks(tick_marks, df.columns)
plt.show()
Another important thing is correlation. If there is very high correlation between two features, keeping both of them is not a good idea most of the time not to cause overfitting. For instance, if there is overfitting, we may remove CountyFipsCode or Zip because they are highly correlated. This relation can be estimated when we look at the definitions in the dataset but to be sure correlation matrix should be checked. However, this does not mean that you must remove one of the highly correlated features. For example: Zip and FirePlaces. They are highly correlated but I do not think that the relation among them is the same as the relation between Zip and CountyFipsCode.
The correlation matrix will show how correlate the variables are with each other, and more relevant with the independent variable, the sale price.
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(30, 16))
sns.heatmap(df.corr(), annot=True, fmt=".2f", mask=mask, cmap='magma')
plt.show()
Chi-square test of independence of variables in a contingency table.
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed.
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency([df['SellPrice'], df['LastSalePrice']], correction=False)
chi2, p
chi2, p, dof, ex = chi2_contingency([df['SellPrice'], df['YearBuilt']], correction=False)
chi2, p
Calculate the T-test for the means of two independent samples of scores.
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.
from scipy.stats import ttest_ind
stats.ttest_ind(df['SellPrice'], df['LastSalePrice']) # returns t-statistic and p-value
features_grouped_dict['location_related']
df['State'].value_counts()
Since the real estates covered into the dataframe are belonging to two different states, it is expected that when plotting on a map the latitude and longitude coordinates they will be grouped into two different clusters.
plt.title('Localization of the real estates', fontsize=12)
plt.scatter(df['Latitude'], df['Longitude'], c=df['SellPrice'], alpha=0.2, marker="1")
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)
plt.show()
As it can be seen from the previous graphic, the real estates covered in the considered sample is split into two groups:
plt.figure(figsize=(20, 20))
plt.title('Abstract reconstruction of the map: Sold Houses Regions by price: Latitude in [45, 50] and Longitude apx -121')
plt.scatter(x=df[df['Latitude'] > 30]['Latitude'],
y=df[df['Latitude'] > 30]['Longitude'],
c=df[df['Latitude'] > 30]['SellPrice'],
alpha=0.7)
plt.xlabel('Latitude', fontsize=12); plt.ylabel('Longitude', fontsize=12)
plt.show();
plt.figure(figsize=(20, 20))
plt.title('Abstract reconstruction of the map: Sold Houses Regions by price: Latitude in [25, 30] and Longitude apx -80')
plt.scatter(x=df[df['Latitude'] < 30]['Latitude'],
y=df[df['Latitude'] < 30]['Longitude'],
c=df[df['Latitude'] < 30]['SellPrice'],
alpha=0.8)
plt.xlabel('Latitude', fontsize=12); plt.ylabel('Longitude', fontsize=12)
plt.show();
fig, ax = plt.subplots(1,2,figsize=(30,25))
plt.subplot(2,2,1)
plt.title('Location for RCON/RSFR houses: \n Latitude in [25, 30] and Longitude apx -80', fontsize=10)
plt.scatter(x=df[df['Latitude'] < 30]['Latitude'],
y=df[df['Latitude'] < 30]['Longitude'],
c=df[df['Latitude'] < 30]['isRCON'],
alpha=0.4)
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)
plt.subplot(2,2,2)
plt.title('Location for RCON/RSFR houses: \n Latitude in [45, 50] and Longitude apx -121', fontsize=10)
plt.scatter(x=df[df['Latitude'] > 30]['Latitude'],
y=df[df['Latitude'] > 30]['Longitude'],
c=df[df['Latitude'] > 30]['isRCON'],
alpha=0.4)
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)
plt.show();
As it can be seen from the above representations, RCON and RSFR properties are mixed in both of the states. It is difficult to notice a big spot or land surface in which only RCON or only RSFR are exclusively sold.
base_map = folium.Map(location=[26.15, -80.25], control_scale=True, zoom_start=9.4)
df_copy = df[df['Latitude'] < 30].copy()
df_copy['Count'] = 1 # to see how many houses have been sold in a specific area
# add carton position map
folium.TileLayer('cartodbpositron').add_to(base_map)
# add heatmap
HeatMap(data=df_copy[['Latitude','Longitude','Count']].groupby(['Latitude','Longitude']).sum().reset_index().values.tolist(),
radius = 8, max_zoom = 13,name = 'Heat Map').add_to(base_map)
folium.LayerControl(collapsed=True).add_to(base_map)
base_map
base_map = folium.Map(location=[47.5, -121.8], control_scale=True, zoom_start=9.4)
df_copy = df[df['Latitude'] > 30].copy()
df_copy['Count'] = 1 # to see how many houses have been sold in a specific area
# add carton position map
folium.TileLayer('cartodbpositron').add_to(base_map)
# add heatmap
HeatMap(data=df_copy[['Latitude','Longitude','Count']].groupby(['Latitude','Longitude']).sum().reset_index().values.tolist(),
radius = 8, max_zoom = 13,name = 'Heat Map').add_to(base_map)
folium.LayerControl(collapsed=True).add_to(base_map)
base_map
From the above maps, there can be observed the fact the real estates from the dataframes are grouped in two clusters. One is situated on the East coast of USA, near Miami, and another one is on the North-West coast of the USA. The number of the real estates which are sold is bigger in the center of the clusters than in the suburbs.
len(df['Zonning'].value_counts().index)
df['Zonning'].head()
For the Zonning feature this seems to represent a code, precific for the property's localization.
fig = plt.figure(figsize=(15,15))
df_copy = df[df['Latitude'] < 30].copy()
fig = px.scatter_3d(df_copy,
x='Latitude',
y='Longitude',
z='SellPrice',
color='SellPrice',
opacity=0.9,
size='SellPrice',
title='3d price distribution based on geographical coordinates Latitude < 30')
fig.show()
def get_3d_graph_popular_zones_for_selling(df, color = '#5e384b'):
df_copy = df.copy()
df_copy['Count'] = 1
df_copy['Latitude_rounded'] = round(df_copy['Latitude'],2)
df_copy['Longitude_rounded'] = round(df_copy['Longitude'], 2)
no_houses = df_copy[['Latitude_rounded','Longitude_rounded','Count']]\
.groupby(['Latitude_rounded','Longitude_rounded'])\
.sum().reset_index().values.tolist()
no_houses = [tuple(current) for current in no_houses]
no_houses = list(zip(*no_houses))
dict_occ = {str(elem): 0 for elem in set(list(zip(no_houses[0], no_houses[1])))}
for i in range(len(no_houses[0])):
key = str(tuple([no_houses[0][i], no_houses[1][i]]))
dict_occ[key] = int(no_houses[2][i])
z = list(dict_occ.values())
x = []
y = []
for key in dict_occ.keys():
key_tuple = eval(key)
x.append(key_tuple[0])
y.append(key_tuple[1])
fig = plt.figure(figsize=(15, 10))
ax1 = fig.add_subplot(111, projection='3d')
num_elements = len(x)
dx = [0.007 for i in range(num_elements)]
dy = [0.007 for i in range(num_elements)]
dz = z
ax1.bar3d(x=x, y=y, z=[0 for i in range(num_elements)], dx=dx, dy=dy, dz=z, color=color)
plt.title('Zones with the biggest number of houses for selling')
ax1.set_xlabel('Latitude rounded')
ax1.set_ylabel('Longitude rounded')
ax1.set_zlabel('Number of houses')
plt.show()
get_3d_graph_popular_zones_for_selling(df[df['Latitude'] < 30].copy())
get_3d_graph_popular_zones_for_selling(df[(df['Latitude'] > 30) & (df['Longitude'] < -121.8)].copy(), color = '#d59c8f')
columns = features_grouped_dict['code_related']
value_counts = pd.DataFrame({"Predictor\'s Name": columns,
"Number of different values RCON": [len(list(df[df['isRCON']==1][column].value_counts())) for column in columns],
"Number of different values RSFR": [len(list(df[df['isRCON']==0][column].value_counts())) for column in columns]})
value_counts
h = df[features_grouped_dict['code_related']].hist(bins=15, figsize=(9,15),
xlabelsize='10',
ylabelsize='10',
color='#74d2b3')
sns.despine(left=True, bottom=True)
[x.title.set_size(10) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];
features_grouped_dict['dimension_related']
df['SellPricePerLivingSqft'] = df['SellPrice'] / df['LivingSqft']
features_grouped_dict['price_related_regression'].append('SellPricePerLivingSqft')
df['SellPricePerLivingSqft'].head()
fig, ax = plt.subplots(1,1,figsize=(7,7))
plt.scatter(df[df['SellPricePerLivingSqft']< 6000]['SellPricePerLivingSqft'],
df[df['SellPricePerLivingSqft']< 6000]['SellPrice'],
c=df[df['SellPricePerLivingSqft']< 6000]['SellPricePerLivingSqft'],
alpha=0.2, marker="1")
plt.xlabel('SellPricePerLivingSqft', fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.show()
pearsonr(df['SellPricePerLivingSqft'], df['SellPrice'])
Between the 'SellPrice' of a house and 'SellPricePerLivingSqft' variable it is also a slightly tendency of pozitive correlation, fact which is intuitive since if the selling price is bigger it is quite certainely that the SellPricePerLivingSqft will be bigger.
aux = features_grouped_dict['dimension_related']
aux.append('SellPricePerLivingSqft')
sns.pairplot(df[aux], kind="scatter", palette='husl')
plt.show()
There are slightly tendencies of positive correlations between group of variables such as 'BasementSqft' and 'BasementFinishedSqft' or between 'BasementSqft' and 'GarageSqft'. It can also be noticed the fact that if the 'LandSqft' is bigger the 'SellPricePerLivingSqft' will also increase its value.
fig, ax = plt.subplots(2,2,figsize=(15,15))
continuous_variables = ['LandSqft', 'LivingSqft','GarageSqft','BasementSqft']
for i in range(len(continuous_variables)):
plt.subplot(3,2,i+1)
plt.scatter(df[continuous_variables[i]], df['SellPrice'], c=df['SellPrice'], alpha=0.2, marker="1")
plt.xlabel(continuous_variables[i], fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.show()
plt.boxplot(df['LivingSqft'])
plt.show()
It seems that many of the real estates have their LivingSqft smaller that 5000. Since this represents outliers, if these observations represents a small percentage from the entire dataset I am going to drop them.
str(len(list(df[df['LivingSqft']>5000]['LivingSqft'].index)) / len(df['LivingSqft']) * 100) + '% from the entire df'
df = df.drop(df[df['LivingSqft']>5000].index)
def plot_distribution_sqft(df, column_name):
fig, ax = plt.subplots(1,2,figsize=(15,15))
plt.subplot(3,2,1)
plt.boxplot(df[column_name])
plt.title(column_name + ' boxplot')
plt.subplot(3,2,2)
sns.distplot(df[column_name], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RCON + RSFR')
sns.distplot(df[df['isRCON'] == 1][column_name], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RCON')
sns.distplot(df[df['isRCON'] == 0][column_name], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 3},
label = 'Distribution RSFR')
plt.title('Distribution plot ' + column_name)
plt.show()
plot_distribution_sqft(df, 'LivingSqft')
plt.boxplot(df['LandSqft'])
plt.show()
str(len(list(df[df['LandSqft']>500000]['LandSqft'].index)) / len(df['LandSqft']) * 100) + '% from the entire df'
Similarly, for the 'LandSqft' feature.
df = df.drop(df[df['LandSqft']>500000].index)
plot_distribution_sqft(df, 'LandSqft')
plot_distribution_sqft(df[df['LandSqft']<20000], 'LandSqft')
df[features_grouped_dict['number_of']].head()
plt.figure(figsize=(7, 4))
sns.distplot(df['TotalBaths'])
plt.title('Baths distribution')
plt.show()
df[df['SellPricePerLivingSqft']<4000].boxplot(column = ['SellPricePerLivingSqft'], by='Bedrooms', figsize=(7,10))
plt.show()
df[df['SellPricePerLivingSqft']<4000].boxplot(column = ['SellPrice'], by='Bedrooms', figsize=(7,10))
plt.show()
categorical_columns = ['Condition', 'Quality', 'HasPatioPorch', 'HasPool']
fig, ax = plt.subplots(4,1,figsize=(13,9))
for i in range(len(categorical_columns)):
plt.subplot(2,2,i+1)
sns.boxplot(df[categorical_columns[i]], df['SellPrice'])
plt.show()
Principal Components Analysis is a dimension-reduction technique. Its purpose is to reduce redundancy in data. See, sometimes people gather tons of data with 20, 30, or more variables. They may think they’re measuring 20 or 30 things, but they may be measuring fewer underlying factors (often called “latent traits”) than the number of measurements.
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.
char_type_columns = df.select_dtypes(include=object).columns.tolist()
columns_that_dont_need_standardisation = char_type_columns.copy()
columns_that_dont_need_standardisation.append('SellPrice')
features = [x for x in list(df) if x not in columns_that_dont_need_standardisation]
features
X = df[features].copy()
y = df['SellPrice']
# In general a good idea is to scale the data
X = stats.zscore(X)
normalised_df = pd.DataFrame(X,columns=features)
normalised_df.tail()
pca = PCA()
pca_houses = pca.fit_transform(X)
pca_houses.shape[1]
principal_houses_df = pd.DataFrame(data = pca_houses,
columns = ['PC' + str(i) for i in range(1, len(list(normalised_df))+1, 1)])
principal_houses_df['SellPrice'] = list(df['SellPrice'])
principal_houses_df.tail()
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
plt.figure(figsize=(10,4))
plt.bar([i for i in range(1,45, 1)], height=pca.explained_variance_ratio_, color='#c5a84e')
plt.title('Percentage of variance explained - PCA')
plt.show()
fig = plt.figure(figsize=(15,15))
plt.scatter(pca_houses[:,0], pca_houses[:,1], c = y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
def biplot(score,coeff,ids=[1, 2], labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
fig = plt.figure(figsize=(25,25))
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-0.4, 0.5)
plt.ylim(-0.35,0.55)
plt.xlabel("PrincipalComponent{}".format(ids[0]))
plt.ylabel("PrincipalComponent{}".format(ids[1]))
plt.grid()
plt.show()
biplot(pca_houses[:,:2], np.transpose(pca.components_[:2, :]),ids=[1, 2], labels=features)
biplot(pca_houses[:,1:3], np.transpose(pca.components_[1:3, :]),ids=[2, 3], labels=features)
ax = plt.figure(figsize=(13,13)).gca(projection='3d')
ax.scatter(
xs=principal_houses_df['PC1'],
ys=principal_houses_df['PC2'],
zs=principal_houses_df['PC3'],
c=principal_houses_df['SellPrice'],
cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.show()
from sklearn.manifold import TSNE
tsne = TSNE()
X_embedded = tsne.fit_transform(X[:,][:10000])
fig = plt.figure(figsize=(15, 15))
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y[:10000], size=y[:10000], legend='brief', palette="Set2")
plt.show()
df.to_csv('processed_df.csv')
normalised_df.to_csv('normalised_df.csv')
# code in the R notebook
It is certainly clear that when predicting the SellPrice of a house it is necessary to look for features that have a rich predictive power.
In this manner, it was noticed in this notebook, that features related to taxes amount, sell price per living sqft are correlated to the current sell price. Predictors which are related to quality or dimensionality or the ones that represent a number of items in the house can increase the accuracy of a regression model, fact which is quite intuitive.
The geographical localization can also represent an important aspect. There are going to be used the code related features as a replacement for categorical / string features when building a regression model. This will help in skipping the part when using different techniques to transform the categorical columns into numerical ones.
By looking from a classification point of view, the significant features are analogically the same, with the specification that the RSFR houses are more expensive than the RCON ones.